Background

Phea is a framework for computing electronic [patient] phenotypes:

The intended use case is where you have a database with patient records and want to compute a formula. However, in principle, Phea’s approach need not be restricted to this use case.

Phea passes formulas unmodified to the SQL query, so they need not be restricted to mathematical operators. The formulas can include any SQL expression that would be valid in a SELECT statement, such as CASE WHEN ... expressions and function calls.

To use Phea, you need a DBI connection to the SQL server, generated by functions such as DBI::dbConnect() or DatabaseConnector::connect().

Phea also includes a function to create an interactive plot of the phenotype results in one patient. Below is an example of the body mass index formula.

Intuition

Phea goes over patient records in chronological order. Say you want to compute a formula:

body_mass_index = body_weight / (body_height * body_height)

Phea will compute body_mass_index for each person using the most recently available data at each point in time.

In this example, the formula depends on body_weight and body_height. Therefore, their timestamps will be the timestamps for the resulting body_mass_index. This means that at every point in time where there is a body_weight or body_height a record for a given patient, Phea will try to calculate body_mass_index again.

Suppose a patient had a body_weight = 75 kg measurement on May 1st, a body_height = 1.8 meters on August 1st, then a new body_weight = 79 kg on October 1st. Phea would produce 3 result rows as follows:

May 1st:
body_weight = 75 kg
body_height = NULL
body_mass_index = 75 / (NULL * NULL) = NULL

August 1st:
body_weight = 75 kg
body_height = 1.8 meters
body_mass_index = 75 / (1.8 * 1.8) = 23.14815

October 1st:
body_weight = 79 kg
body_height = 1.8 meters
body_mass_index = 79 / (1.8 * 1.8) = 24.38272

The algorithm continues as long as new records are avaialable. For example, if on November 1st a new record body_height = 1.78 meters appears, the result would be:

November 1st:
body_weight = 79 kg
body_height = 1.78 meters
body_mass_index = 75 / (1.78 * 1.78) = 24.93372

At each point in time, the most recently available data is used, unless you specify otherwise when creating a component.

How to set up

First, you need to provide Phea with your SQL connection. While Phea is not necessarily restricted to PostgreSQL, this is the only SQL flavor where it has been tested.

library(phea)

# Connect to SQL server
dbcon <- DBI::dbConnect(
    RPostgres::Postgres(),
    host = 'localhost',
    port = 7654,
    dbname = 'fort',
    user = cred$pg$user, # Credentials not declared in this file
    password = cred$pg$pass) # Credentials not declared in this file
    
# Setup Phea and define cdm_new_york3 as default schema.
setup_phea(
  connection = dbcon,
  schema = 'cdm_new_york3')

setup_phea() will save dbi_connection and def_schema for use by functions sql0() and sqlt(). Most importantly, however, it will install two user-defined functions on the SQL server (phea_coalesce_r_sfunc and phea_find_last_ignore_nulls), if they are not already there.

How to compute a formula

Each formula computation comes from combining three parts:

A record source is any SQL query that provides data you want to use. Its records are assigned a label rec_name. The record source is assumed to be a long table, that is, each row corresponds to one record. At a minimum, each row in a record source must have a patient identifier (pid) column and a timestamp (ts) column.

For exemple, suppose you have an OMOP Common Data Model schema called cdm_new_york3, and you would like to compute a formula involving a patient’s hemoglobin A1c level.
- Such hemoglobin records are in table measurement inside schema cdm_new_york3.
- They are identified by the OMOP CDM concept_id of 3004410 (corresponds to LOINC code 4548-4, Hemoglobin A1c/Hemoglobin.total in Blood).
- The patient identifiers are in column person_id
- The timestamps are in column measurement_datetime.
- In the formula, we want to refer to this value as “hemo_a1c”.

Let’s create the record source:

hemoglobin_query <- sql0('
  select * from cdm_new_york3.measurement
  where measurement_concept_id = 3004410
')

hemoglobin_record_source <- make_record_source(
  records = hemoglobin_query,
  ts = measurement_datetime,
  pid = person_id)

A component is a specification of what records to pick from record source. If you just want the most recently available record, the default values need not be overriden:

hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source)

The hemoglobin_component can now be used to call calculate_formula(). For the sake of the example, let us suppose the phenotype phen that you want to compute is merely the square root of the patient’s hemoglobin A1c level.

phen = sqrt(hemoglobin A1c level)

The numeric value hemoglobin A1c is located in column value_as_number within the record source. The call to calculate_formula() is then:

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component),
  fml = 'sqrt(hemo_a1c_value_as_number)')

Phea automatically recognizes that hemo_a1c_value_as_number means column value_as_number in the SQL table associated with component hemo_a1c.

Object phen is a lazy table in the dplyr/dbplyr framework. It can be materialized into a tibble by calling collect(), or manipulated further using Phea or dplyr/dbplyr.

dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_value_as_number value
1 1 2011-05-20 00:00:00 6.9 2.626785
2 1 2014-03-07 00:00:00 6.9 2.626785
3 1 2016-03-11 00:00:00 6.9 2.626785
4 1 2018-03-16 00:00:00 6.9 2.626785
5 1 2020-03-20 00:00:00 6.9 2.626785
6 1 2022-02-11 00:00:00 6.9 2.626785

To obtain just the SQL code, sql_render() can be used, or the option .clip_sql in calculate_formula():

dbplyr::sql_render(phen)
#> <SQL> SELECT
#>   "row_id",
#>   "pid",
#>   "ts",
#>   "window",
#>   "hemo_a1c_value_as_number",
#>   sqrt(hemo_a1c_value_as_number) AS "value"
#> FROM (
#>   SELECT
#>     *,
#>     "ts" - ts AS "window",
#>     last_value(row_id) over (partition by "pid", "ts") AS "ts_row"
#>   FROM (
#>     SELECT
#>       "row_id",
#>       "pid",
#>       "ts",
#>       MAX("hemo_a1c_value_as_number") OVER (PARTITION BY "..dbplyr_partion_1") AS "hemo_a1c_value_as_number",
#>       MAX("hemo_a1c_ts") OVER (PARTITION BY "..dbplyr_partion_2") AS "hemo_a1c_ts"
#>     FROM (
#>       SELECT
#>         *,
#>         SUM(CASE WHEN (("hemo_a1c_value_as_number" IS NULL)) THEN 0 ELSE 1 END) OVER (ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_1",
#>         SUM(CASE WHEN (("hemo_a1c_ts" IS NULL)) THEN 0 ELSE 1 END) OVER (ORDER BY "pid", "ts" ROWS UNBOUNDED PRECEDING) AS "..dbplyr_partion_2"
#>       FROM (
#>         SELECT
#>           row_number() over () AS "row_id",
#>           "pid",
#>           "ts",
#>           last_value(case when "name" = 'gwixomba5v8j' then "value_as_number" else null end) over (partition by "pid", "name" order by "ts" rows between unbounded preceding and 0 preceding) AS "hemo_a1c_value_as_number",
#>           last_value(case when "name" = 'gwixomba5v8j' then "ts" else null end) over (partition by "pid", "name" order by "ts" rows between unbounded preceding and 0 preceding) AS "hemo_a1c_ts"
#>         FROM (
#>           SELECT
#>             'gwixomba5v8j' AS "name",
#>             "person_id" AS "pid",
#>             "measurement_datetime" AS "ts",
#>             "value_as_number"
#>           FROM (
#> 
#>   select * from cdm_new_york3.measurement
#>   where measurement_concept_id = 3004410
#> 
#>           ) "q01"
#>         ) "q02"
#>       ) "q03"
#>     ) "q04"
#>   ) "q05"
#> ) "q06"
#> WHERE ("row_id" = "ts_row")

Notice that the result of the sqrt(hemo_a1c_value_as_number) is in column value. This can be changed by giving a name to the formula in the call to calculate_formula():

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component),
  fml = list(
    a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)'))

dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_value_as_number a1c_sqrt
1 1 2011-05-20 00:00:00 6.9 2.626785
2 1 2014-03-07 00:00:00 6.9 2.626785
3 1 2016-03-11 00:00:00 6.9 2.626785
4 1 2018-03-16 00:00:00 6.9 2.626785
5 1 2020-03-20 00:00:00 6.9 2.626785
6 1 2022-02-11 00:00:00 6.9 2.626785

Multiple formulas can be computed at once:

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component),
  fml = list(
    a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)',
    a1c_cubed = 'hemo_a1c_value_as_number * hemo_a1c_value_as_number * hemo_a1c_value_as_number'))

dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_value_as_number a1c_sqrt a1c_cubed
1 1 2011-05-20 00:00:00 6.9 2.626785 328.509
2 1 2014-03-07 00:00:00 6.9 2.626785 328.509
3 1 2016-03-11 00:00:00 6.9 2.626785 328.509
4 1 2018-03-16 00:00:00 6.9 2.626785 328.509
5 1 2020-03-20 00:00:00 6.9 2.626785 328.509
6 1 2022-02-11 00:00:00 6.9 2.626785 328.509

If you need to use the result of one formula in the next one, set .cascaded = TRUE:

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component),
  fml = list(
    a1c_sqrt = 'sqrt(hemo_a1c_value_as_number)',
    a1c_half_sqrt = 'a1c_sqrt / 2'),
  .cascaded = TRUE)

dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_value_as_number a1c_sqrt a1c_half_sqrt
1 1 2011-05-20 00:00:00 6.9 2.626785 1.313393
2 1 2014-03-07 00:00:00 6.9 2.626785 1.313393
3 1 2016-03-11 00:00:00 6.9 2.626785 1.313393
4 1 2018-03-16 00:00:00 6.9 2.626785 1.313393
5 1 2020-03-20 00:00:00 6.9 2.626785 1.313393
6 1 2022-02-11 00:00:00 6.9 2.626785 1.313393

Example with more components

So far the examples showed formulas with only one component. Suppose you would like to compute:
phen = sqrt([hemoglobin a1c]) * [patient's age at the time of measurement]

To obtain the patient’s age, we can collect the date of birth from table person, then create a component from it.

person_records <- make_record_source(
  records = sqlt(person),
  ts = birth_datetime,
  pid = person_id)

person_component <- make_component(person_records)

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    person = person_component),
  fml = list(
    pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
    a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'),
  .cascaded = TRUE)

dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_measurement_datetime person_birth_datetime hemo_a1c_value_as_number pat_age a1c_times_age
8 1 1974-03-01 00:00:00 NA 1974-03-01 NA NA NA
1 1 2011-05-20 13594 days 2011-05-20 1974-03-01 6.9 37 years 2 mons 19 days 255.3
2 1 2014-03-07 14616 days 2014-03-07 1974-03-01 6.9 40 years 6 days 276.0
3 1 2016-03-11 15351 days 2016-03-11 1974-03-01 6.9 42 years 10 days 289.8
4 1 2018-03-16 16086 days 2018-03-16 1974-03-01 6.9 44 years 15 days 303.6
5 1 2020-03-20 16821 days 2020-03-20 1974-03-01 6.9 46 years 19 days 317.4

Phea includes in the result all data points used in the formula computation. If other columns from the same rows are wanted, export = can be used:

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    person = person_component),
  fml = list(
    pat_age = 'age(hemo_a1c_measurement_datetime, person_birth_datetime)',
    a1c_times_age = 'hemo_a1c_value_as_number * extract(year from pat_age)'),
  .cascaded = TRUE,
  export = c(
    'hemo_a1c_provider_id',
    'hemo_a1c_visit_occurrence_id',
    'person_gender_concept_id'))

dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_measurement_datetime person_birth_datetime hemo_a1c_value_as_number hemo_a1c_provider_id hemo_a1c_visit_occurrence_id person_gender_concept_id pat_age a1c_times_age
8 1 1974-03-01 00:00:00 NA 1974-03-01 NA NA NA 8507 NA NA
1 1 2011-05-20 13594 days 2011-05-20 1974-03-01 6.9 22084 21 8507 37 years 2 mons 19 days 255.3
2 1 2014-03-07 14616 days 2014-03-07 1974-03-01 6.9 27 24 8507 40 years 6 days 276.0
3 1 2016-03-11 15351 days 2016-03-11 1974-03-01 6.9 22084 19 8507 42 years 10 days 289.8
4 1 2018-03-16 16086 days 2018-03-16 1974-03-01 6.9 22084 18 8507 44 years 15 days 303.6
5 1 2020-03-20 16821 days 2020-03-20 1974-03-01 6.9 22084 17 8507 46 years 19 days 317.4

Using records other than the one most recently available

When creating a component, arguments line and delay can be used to pick records other than the one most recently available.

line causes records to be skipped in reverse chronological order. line = 0 (default) gives the most recent record available. line = 1 skips 1 record, that is, gives the second most recent. line = 2 skips 2 records, and so on.

hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source)

previous_hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source,
  line = 1)

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    prev_hemo_a1c = previous_hemoglobin_component),
  fml = list(
    a1c_difference = 'hemo_a1c_value_as_number - prev_hemo_a1c_value_as_number'),
  export = c(
    'hemo_a1c_visit_occurrence_id',
    'prev_hemo_a1c_visit_occurrence_id'))
dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_value_as_number prev_hemo_a1c_value_as_number hemo_a1c_visit_occurrence_id prev_hemo_a1c_visit_occurrence_id a1c_difference
1 1 2011-05-20 00:00:00 6.9 NA 21 NA NA
2 1 2014-03-07 1022 days 6.9 6.9 24 21 0
3 1 2016-03-11 735 days 6.9 6.9 19 24 0
4 1 2018-03-16 735 days 6.9 6.9 18 19 0
5 1 2020-03-20 735 days 6.9 6.9 17 18 0
6 1 2022-02-11 693 days 6.9 6.9 29 17 0

delay imposes a minimum time difference between the timestamp of the component and the timestamp of the phenotype. They are defined in SQL language. For example, delay = '1 day' gives the most recently available record that is at least 1 day older than the phenotype. If there are no such records, the value is empty.

hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source)

old_hemoglobin_component <- make_component(
  input_source = hemoglobin_record_source,
  delay = '6 years')

phen <- calculate_formula(
  components = list(
    hemo_a1c = hemoglobin_component,
    old_hemo_a1c = old_hemoglobin_component),
  fml = list(
    six_year_diff = 'hemo_a1c_value_as_number - old_hemo_a1c_value_as_number'),
  export = c(
    'hemo_a1c_visit_occurrence_id',
    'old_hemo_a1c_visit_occurrence_id'))
dplyr::collect(head(phen)) |> knitr::kable()
row_id pid ts window hemo_a1c_value_as_number old_hemo_a1c_value_as_number hemo_a1c_visit_occurrence_id old_hemo_a1c_visit_occurrence_id six_year_diff
1 1 2011-05-20 00:00:00 6.9 NA 21 NA NA
2 1 2014-03-07 00:00:00 6.9 NA 24 NA NA
3 1 2016-03-11 00:00:00 6.9 NA 19 NA NA
4 1 2018-03-16 2492 days 6.9 6.9 18 21 0
5 1 2020-03-20 2205 days 6.9 6.9 17 24 0
6 1 2022-02-11 2898 days 6.9 6.9 29 24 0

As of now, line and delay cannot be used simultaneously in the same component. This is a planned feature. If both are provided, line is considered and delay is ignored. delay is only considered when line = NA.

Author contact

Contact Fabrício Kury at .